High Performing Players

Rationale for Variable Selection

To find the best players, it was crucial that the variables taken into account consisted of the number of games that the players started in and the total number of points that a player had for the season. The number of games that a player has started is important because it shows how other teams thought highly of the player. The field goals percentage is also important because it has the overall number of field goals including 3-pointers and 2-pointers that the players correctly executed. The amount of total rebounds is also essential to know because it can be an important factor of determining possession of the ball in future games. The number of assists is crtiical as a variable because it shows how the player is the second-to-last player to touch the ball before a point is scored. Lastly, the number of points that a player had for the season is important because it shows how many times the player was able to execute a play and the throw perfectly.

Details of the Approach

This study will determine which professional players are a high-performance athlete with a low salary. To ensure that this determination is accurate, the data will first be normalized. Then, the variables will be selected and run through the k-means algorithm. The number of clusters will be determined by the Nbclust package. Then the data will be plotted to visualize the data and then the results will be validated.

####Merging the Datasets

# converting the data into data frame format 
nba <- as.data.frame(nba)
nba_sal <- as.data.frame(nba_sal)
nba <- merge(nba, nba_sal)
# normalize the columns before they're added 
nba$GS <- normalize(nba$GS)
nba$FG <- scale(nba$FG, center= TRUE, scale = TRUE)
nba$TRB <- normalize(nba$TRB)
nba$AST <- normalize(nba$AST)
nba$PTS <- normalize(nba$PTS)

# Subsetting the data with the selected variables 
clust_data = nba[, c("GS","FG", "TRB", "AST","PTS" )] #
View(clust_data)
View(nba)

Run the K-means algorithm

# Run an algorithm with 2 centers and make the results reproducible with set.seed 
set.seed(1)
kmeans_obj_nba = kmeans(clust_data, centers = 2, algorithm = "Lloyd")
head(kmeans_obj_nba)
## $cluster
##   [1] 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 2 1 2 1 1 2 1 1 2
##  [38] 2 1 2 2 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 2 2 1 2 1 1 1 2 2 1 1 1 2 1 1 2 1 1
##  [75] 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 2 1 1 2 1 2 1 1 2 1 1 1
## [112] 1 1 1 1 1 1 1 1 2 1 1 2 2 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 2 1 2 2 2 1 2 1 1
## [149] 1 1 2 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [186] 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 2 1 1 1
## [223] 2 2 1 2 2 1 2 2 2 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 2 2
## [260] 1 1 1 1 2 1 1 2 2 2 2 2 1 1 2 1 1 1 1 2 2 2 2 2 1 2 1 1 1 2 2 1 1 1 1 1 1
## [297] 1 2 2 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 1 1 2 1 2 1 1 1 1 1 1 1 1 2 2 1
## [334] 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1
## [371] 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 2 2 1 2 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 1
## [445] 1 1 1 1 1 2 1 1 1 2 1 2
## 
## $centers
##          GS        FG       TRB        AST       PTS
## 1 0.1646724 -0.525990 0.1476800 0.09762642 0.1393395
## 2 0.7061573  1.277404 0.3925738 0.34335840 0.5222029
## 
## $totss
## [1] 562.043
## 
## $withinss
## [1]  99.39642 103.48878
## 
## $tot.withinss
## [1] 202.8852
## 
## $betweenss
## [1] 359.1578

Visualizing Various Plots and Correlations

####Correlation between the Games Started and Points

# This plot shows the correlation between the games started and the points each player has 
sal_clusters = as.factor(kmeans_obj_nba$cluster)
b <- ggplot(nba, aes(x = GS, y = PTS, shape = sal_clusters, color = "2020-21", text=Player))+geom_point(size = 6)+ggtitle("Games Started vs. Points for NBA Basketball players") +xlab("Number of Games Started")+ylab("Number of Points")+scale_shape_manual(name = "Cluster", labels = c("Cluster 1", "Cluster 2"), values = c("1", "2"))+ theme_light()
ggplotly(b, tooltip="text")

####Correlation between the Games Started and Total Rebounds

sal_clusters = as.factor(kmeans_obj_nba$cluster)
c <-ggplot(nba, aes(x = GS, y = TRB, shape = sal_clusters, color = "2020-21", text=Player))+geom_point(size = 6) + ggtitle("Games Started vs. Total Rebounds NBA Basketball players") + xlab("Number of Games Started")+ylab("Number of Total Rebounds") + scale_shape_manual(name = "Cluster", labels = c("Cluster 1", "Cluster 2"), values = c("1", "2"))+ theme_light()
ggplotly(c, tooltip="text")

####Correlation between the Games Started and Field Goals

sal_clusters = as.factor(kmeans_obj_nba$cluster)
a <-ggplot(nba, aes(x = GS, y = FG, shape = sal_clusters, color = "2020-21", text = Player))+geom_point(size = 6) + ggtitle("Games Started vs. Field Goals for NBA Basketball players") + xlab("Number of Games Started")+ylab("Number of Field Goals") + scale_shape_manual(name = "Cluster", labels = c("Cluster 1", "Cluster 2"), values = c("1", "2"))+ theme_light()
ggplotly(a, tooltip="text")

Finding the ideal number of clusters

# using the NbClust algorithm to find the ideal number of clusters 
(nbclust_obj_nba = NbClust(data = clust_data, method= "kmeans"))

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 8 proposed 2 as the best number of clusters 
## * 6 proposed 3 as the best number of clusters 
## * 3 proposed 4 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 2 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 2 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
## $All.index
##         KL       CH Hartigan     CCC    Scott   Marriot    TrCovW   TraceW
## 2   3.0631 803.6941 378.8952  6.9172 2758.852 112045.62 3423.1353 202.8852
## 3   2.2640 924.6234 222.2366 14.8767 3136.610 110103.39  893.7890 110.5900
## 4   2.3614 990.7091 122.0779 17.6022 3444.050  99740.56  300.5948  74.1922
## 5   1.9726 972.0767  77.4164 17.9587 3681.684  92548.54  206.5897  58.4152
## 6   1.1996 924.5795  67.0902 17.5206 3962.401  72007.44  149.0231  49.8570
## 7  20.5719 894.5432  26.4921 17.4593 4087.379  74514.71  109.3953  43.3882
## 8   0.0451 813.9592  66.2986 15.9640 4192.734  77247.73   93.1312  40.9709
## 9  75.5983 824.0572  18.6549 18.5210 4359.232  67860.60   68.7249  35.6893
## 10  0.0489 763.4260  29.5501 17.6752 4436.504  70719.51   63.8859  34.2595
## 11  1.5107 733.9124  23.9469 17.7246 4583.319  62015.10   55.2905  32.1307
## 12  1.6571 703.6891   9.5826 17.5989 4630.069  66611.65   48.9329  30.4899
## 13  0.4260 658.2826  24.9826 16.7159 4650.373  74771.69   47.0832  29.8458
## 14  0.2977 642.3814  55.0067 16.9179 4744.486  70546.12   42.3623  28.2525
## 15  7.4899 673.1335  17.2746 19.0006 4890.092  58847.00   30.5451  25.1256
##    Friedman   Rubin Cindex     DB Silhouette   Duda Pseudot2   Beale Ratkowsky
## 2  253.1478  3.3181 0.2167 0.7359     0.5699 0.7740 115.0449  0.9110    0.5031
## 3  282.4617  6.0873 0.2184 0.7535     0.5139 1.0477 -14.9314 -0.1419    0.4689
## 4  307.5183  9.0736 0.2242 0.8201     0.4716 1.3500 -45.6268 -0.8070    0.4176
## 5  314.6520 11.5243 0.1958 0.9159     0.4412 0.9844   1.1435  0.0489    0.3831
## 6  321.6205 13.5025 0.1970 0.9357     0.4452 1.7746 -37.1015 -1.3358    0.3581
## 7  323.5198 15.5156 0.1991 0.9791     0.4348 2.0499 -92.7042 -1.5917    0.3334
## 8  326.5670 16.4310 0.1902 1.0554     0.4180 0.9502  10.1228  0.1630    0.3128
## 9  334.6519 18.8626 0.1673 1.0195     0.4090 1.0931  -4.0034 -0.2583    0.2967
## 10 340.7736 19.6498 0.1628 1.0640     0.3987 1.9732 -73.4876 -1.5303    0.2826
## 11 343.5793 20.9517 0.1542 1.0698     0.3969 0.8568  23.7239  0.5184    0.2704
## 12 346.1570 22.0792 0.1493 1.0644     0.3625 2.3743 -63.6704 -1.7787    0.2595
## 13 349.5595 22.5557 0.1485 1.0936     0.3614 1.9891 -39.2844 -1.5295    0.2500
## 14 358.4058 23.8277 0.1421 1.1314     0.3525 2.4180 -48.0875 -1.8027    0.2408
## 15 369.1553 26.7931 0.1895 1.1076     0.3408 2.5750 -42.2039 -1.8813    0.2343
##        Ball Ptbiserial   Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  101.4426     0.6764 1.2708  0.2556 0.0208 0.0023  7.7445 0.5796 0.8518
## 3   36.8633     0.6497 1.5072  0.4279 0.0319 0.0028  5.6287 0.4358 0.4579
## 4   18.5480     0.5845 1.3564  0.6085 0.0295 0.0030  5.0875 0.3584 0.4513
## 5   11.6830     0.5243 0.6694  0.7960 0.0145 0.0032  5.7745 0.3065 0.3784
## 6    8.3095     0.5117 0.5801  0.8310 0.0160 0.0032  6.5039 0.2880 0.2871
## 7    6.1983     0.5028 1.0068  0.8484 0.0175 0.0032  6.0277 0.2715 0.1598
## 8    5.1214     0.4911 1.4461  0.8832 0.0250 0.0032  7.6111 0.2639 0.2649
## 9    3.9655     0.4369 0.6860  1.0943 0.0272 0.0033  7.6024 0.2371 0.1020
## 10   3.4260     0.4321 0.6208  1.1081 0.0324 0.0033  9.1923 0.2319 0.0979
## 11   2.9210     0.4236 4.8170  1.1279 0.0324 0.0033  9.2106 0.2246 0.1011
## 12   2.5408     0.3707 4.2427  1.4869 0.0217 0.0034 11.9838 0.2123 0.1073
## 13   2.2958     0.3624 0.3658  1.5567 0.0177 0.0034 13.8035 0.2089 0.0907
## 14   2.0180     0.3582 0.5465  1.5593 0.0177 0.0034 13.5507 0.2046 0.0840
## 15   1.6750     0.3545 0.2330  1.5747 0.0242 0.0034 17.1298 0.1969 0.0644
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.7696           117.9564       0.4728
## 3          0.7546           106.6898       1.0000
## 4          0.7365            62.9768       1.0000
## 5          0.6374            40.9588       0.9985
## 6          0.5995            56.7775       1.0000
## 7          0.7172            71.3659       1.0000
## 8          0.7183            75.6915       0.9760
## 9          0.5502            38.4256       1.0000
## 10         0.7014            63.4302       1.0000
## 11         0.7007            60.6631       0.7625
## 12         0.6251            65.9670       1.0000
## 13         0.6315            46.1003       1.0000
## 14         0.6273            48.7193       1.0000
## 15         0.6315            40.2648       1.0000
## 
## $Best.nc
##                      KL       CH Hartigan     CCC    Scott  Marriot   TrCovW
## Number_clusters  9.0000   4.0000   3.0000 15.0000   3.0000     6.00    3.000
## Value_Index     75.5983 990.7091 156.6586 19.0006 377.7579 23048.36 2529.346
##                  TraceW Friedman   Rubin  Cindex     DB Silhouette  Duda
## Number_clusters  3.0000   3.0000  9.0000 14.0000 2.0000     2.0000 2.000
## Value_Index     55.8974  29.3138 -1.6444  0.1421 0.7359     0.5699 0.774
##                 PseudoT2 Beale Ratkowsky    Ball PtBiserial   Frey McClain
## Number_clusters   2.0000 2.000    2.0000  3.0000     2.0000 4.0000  2.0000
## Value_Index     115.0449 0.911    0.5031 64.5793     0.6764 1.3564  0.2556
##                    Dunn Hubert SDindex Dindex    SDbw
## Number_clusters 10.0000      0  4.0000      0 15.0000
## Value_Index      0.0324      0  5.0875      0  0.0644
## 
## $Best.partition
##   [1] 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 2 1 2 1 1 2 1 1 2
##  [38] 2 1 2 2 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 2 2 1 2 1 1 1 2 2 1 1 1 2 1 1 2 1 1
##  [75] 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 2 1 1 2 1 2 1 1 2 1 1 1
## [112] 1 1 1 1 1 1 1 1 2 1 1 2 2 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 2 1 2 2 2 1 2 1 1
## [149] 1 1 2 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [186] 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 2 1 1 1
## [223] 2 2 1 2 2 1 2 2 2 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 2 2
## [260] 1 1 1 1 2 1 1 2 2 2 2 2 1 1 2 1 1 1 1 2 2 2 2 2 1 2 1 1 1 2 2 1 1 1 1 1 1
## [297] 1 2 2 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 1 1 2 1 2 1 1 1 1 1 1 1 1 2 2 1
## [334] 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1
## [371] 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 2 2 1 2 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 1
## [445] 1 1 1 1 1 2 1 1 1 2 1 2
# subset the first row from Best.nc and convert to a data frame 
freq_k_nba = nbclust_obj_nba$Best.nc[1,]
freq_k_nba = data.frame(freq_k_nba)

#Plot the recommended number of clusters as a histogram 
ggplot(freq_k_nba, aes(x = freq_k_nba)) + geom_bar()+ scale_x_continuous(breaks = seq(0, 15, by = 1)) + scale_y_continuous(breaks = seq(0, 12, by = 1)) + labs(x = "Number of Clusters", y = "Number of Votes", title = "Cluster Analysis")

From the cluster analysis, the recommended number of clusters is 2.

Final Recommendations

I would recommend Michael Porter Jr., Norman Powell, and PJ Washington. I recommend these three players because they have done well in terms of how many field goals they’ve completed during the last season as well as the number of points that they accumulated. They are also not paid as well as the other athletes and seem to be high-performing.